Analising @WeRateDogs

Table of Contents

Introduction

This paper presents the work of data wrangling and exploratory data analysis of different datasets from WeRateDogs' @dog_rates Twitter account. This account rates images of dogs as well as adding funny pavement comments about each dog.

The work is developed from a primary file, sourced from Twitter and provided by Udacity; however this file does not contain all the information from the tweets. The second file corresponds to the product delivered by a learning machine based on an image recognition neural network, with the breed of each of the dogs. Finally, a third file, obtained directly from the Twitter API, is downloaded from the ID of each tweet in the primary file.

Objectives

This paper presents the work of data wrangling and exploratory data analysis of different datasets from WeRateDogs' @dog_rates Twitter account. This account rates images of dogs as well as adding funny pavement comments about each dog.

The work is developed from a primary file, sourced from Twitter and provided by Udacity; however this file does not contain all the information from the tweets. The second file corresponds to the product delivered by a learning machine based on an image recognition neural network, with the breed of each of the dogs. Finally, a third file, obtained directly from the Twitter API, is downloaded from the ID of each tweet in the primary file.

This analysis will seek to answer the following questions.

  1. What are the most common dog breeds on WeRateDogs?
  2. What are the most common names?
  3. Which breeds achieve the highest ratings?
  4. Which breeds get the most reactions (Retweets and Likes)?
  5. Is there a relationship between number of reactions and rating?

Data Wrangling

Gather

In the gather stage, all the data needed to work are obtained. This involves interfacing with different sources; for this study we read data from TSC and CSV text files, as well as reading directly from the Twitter API.

Read data from Udacity (provided by @WeRateDogs)

Read data from Image Predictions File

Read data from Twitter API

Next, the Twitter API is used to query detailed information about the Tweets we have; the IDs of each of the Tweets are used to query the details of each of them. The data obtained comes in JSON format and is stored in the tweet_json.txt file.

Because the flattened data source contains 326 columns and is then joined to another dataframe, and in order to "slim down" this dataframe, columns are selected on the basis that they are useful for the purposes of the analysis.

By leaving the file from the API on the left side of the merge, the resulting dataframe has twenty-five fewer tweets; for the purposes of this study, this is beneficial. This may seem paradoxical, but the missing tweets are tweets that no longer exist, probably deleted by the users themselves, so leaving them out of the analysis has been considered beneficial to obtain updated results.

Merging the data frames, applying Left Join to add the data corresponding to the race predictions to the data coming from Twitter. However, it is necessary to specify that only the breed most likely to be the one correctly predicted by the neural network from which these data are derived has been retained.

Assess

Once the data has been collected in a single dataframe, the process continues with its evaluation.

In the assess stage, the quality of the data and its tidiness is evaluated.

From the quality point of view:

In terms of tidiness, the dataset is expected to be neat and tidy, complying with:

After visually and programmatically reviewing the dataset, it can be affirmed that:

Clean

This stage of the data wrangling process involves improving the quality of the data, based on the observations detected in the assess stage.

Creating a copy of the dataframe to make the cleaning process

Improve dataset of quality issues

1. Correcting null values for 'expanded_urls'

Define
Code
Test

2. Correcting non-descriptive names

Define

There are a set of columns with non-descriptive names

Code
Test

3. Clean 'source' column

Define

The column called 'source' informs from which Twtiter client software the post was made, however, this value comes inside an HTML tag for link. It needs to be cleaned up.

Code
Test

4. Correcting strange dog names

Some strange values for dog breeds are observed, such as 'orange', 'paper_towel', 'basset', among others. However, the breed predictions are marked with the result of the test to the prediction, through a boolean value.

Code
Test

5 . Droping retweets

Define

The statement of this study asks not to consider retweets. After reviewing the dataset, it has been observed that the column 'retweeted_status_id' contains values just when the record corresponds to a retweet. Therefore, the valid values according to the statement contain a null value in this column.

Code
Testing

6. Correcting breed failed predictions

Define

An exhaustive review of the cases where there is no information regarding breed prediction was carried out, in all cases corresponding to tweets without images. According to the requirement presented for this project, these records are not considered for the analysis.

Code
Test

7. Cleaning outliers in 'rating_numerator'

Define

Outliers are detected in the column 'rating_numerator'.

Code
Test

8. Cleaning outliers in 'rating_denominator'

Define

Outliers are detected in the column 'rating_denominator'.

Code
Test

9. Droping repeated and null columns

Define

Due to sucesive merges, there are some repeated columns as 'tweet_id_x', 'tweet_id_y'; already exists 'id. Also, there are 'created_at' and 'timestamp' with the same information, different format; for that reason 'timestamp' stays, but with a more readable name.

At the same time, exists columns with all its null values. There are: 'retweeted_status_id', 'retweeted_status_user_id' and 'retweeted_status_timestamp'.

Finally, it is also decided to remove the columns 'in_reply_to_status_id' and 'in_reply_to_user_id'; they contain only 23 non-null values and do not answer any of the questions targeted by the analysis to be carried out.

Code
Test

Tidying Dataset

1. Create a column to store the size of each dog.

Define

In this moment, the size of the dog is store in four columns, one for each possible size. This information, now, is renited in one column.

Code
Test

2. Separating 'expanded_urls' to make tidy this dataset

Define

Column called 'expanded_urls' contains one or more than one element. Now, it is separating.

Code
Test

 Saving dataframe

3. Merge information about one type of observational unit from three different data sources.

Define

Information about one type of observational unit (tweets) is spread across three different data sources. The information was merged previously, generating a unique dataframe.

 Exploratory Data Analysis

With a clean and archived dataset, it is possible to proceed with exploratory analysis of these data.

Most common breeds

Most common breed, are the Retrievers. Golden Retriever with 12.1% and Labrador Retriever with 8.08% are most preferred; together they add up to 20% aprox.

Charlie, Cooper and Oliver are the most common names of the dogs in WeRateDogs.

High rated breeds

Eskimo_dog, Samoyed and Cardigan are the top breeds, rated by the Twitter account @dog_rates.

 Breeds preferred by users

Breeds more reacted by users in Twitter are: French Bulldog, Cardigan, Basset.

In this graph, each bubble represents one breed. Size indicates the number of records (rows) with its breed. X-Axis shows the rating given by WeRateDogs and Y-Axis the reactions by Twitter users. With this information is possible affirm that Golden Retriever is the most common, valued by users although not by WeRateDogs, although they rate him very well, he is not in the top. They have better rated dogs from breeds such as Saluki, Aghan Hound and Giant Schnauzer.

Relation beweet rating and reactions

The graph shows no major relationship between the preferences of WeRateDogs and the preferences of the users. Although a positive relationship is generated, its slope does not show a steepness, which indicates that it does not reflect much of a relationship between the two, and its $r^2$ value is low, making the line unrepresentative.

Conclusions

This work consisted mainly of the application of Data Wrangling techniques. Data Wrangling involves three main stages: Gather, Assess and Clean; all of them were applied in this study.

In addition to Data Wrangling, Exploratory Data Analysis work is carried out.

This study consisted of the analysis of the tweet archive of Twitter user @dog_rates, also known as WeRateDogs.

A total of three data sources were considered. One file in CSV format, which in this paper is referenced under the name 'primary', is provided by Udacity; it contains incomplete information on the Tweets to be analysed and the file described in the previous paragraph. The second file is in TSV format, it contains the identification of the breed of each dog published in the CSV file, this identification was performed by machine learning; it is also provided by Udacity. The third source of data corresponds to one generated by this same student, from the data obtained from the Twitter API, to achieve this the ID of each tweet is sent and the detail is received in JSON format, this allows the missing information in the primary file to be completed.

The following data quality problems were identified and corrected:

  1. Non-descriptive name of column 'p1'.
  2. Non-descriptive name of column 'p1_conf'.
  3. Non-descriptive name of column 'p1_dog'.
  4. Filter tweets that correspond to retweets, as it is requested in the statement of this project, not to consider these tweets.
  5. Correct unreported values for expanded_urls.
  6. Correct null values for the columns corresponding to dog breed ('doggo','floofer','pupper','puppo').
  7. Correct values for dog breeds that do not correspond to dog breeds, but to other English words. Apparently, objects that the learning machine identified instead of the breed of the dog.
  8. Corrected outliers for the column 'rating_numerator'.
  9. Corrected outliers for the column 'rating_denominator'.
  10. Improved readability of the 'source' column.

Identified and corrected tidiness issues.

  1. Dog sizes correspond to the same variable, which is the size of the dog; however, it is reported in different columns.
  2. The column 'expanded_urls', which contains several URLs in the same cell, is separated to leave one in each column.

As for the questions initially raised, they can now be answered:

  1. What are the most common dog breeds on WeRateDogs?
    R: Most common breed, are the Retrievers. Golden Retriever with 12.1% and Labrador Retriever with 8.08% are most preferred; together they add up to 20% aprox.
  2. What are the most common names?
    R: Charlie, Cooper and Oliver are the most common names of the dogs in WeRateDogs.
  3. Which breeds achieve the highest ratings?
    R: Eskimo_dog, Samoyed and Cardigan are the top breeds, rated by the Twitter account @dog_rates.
  4. Which breeds get the most reactions (Retweets and Likes)?
    R: Breeds more reacted by users in Twitter are: French Bulldog, Cardigan, Basset.
  5. Is there a relationship between number of reactions and rating?
    R: Based on the information available, it can be stated that WeRateDogs' preferences are not necessarily the preferences of its audience.